Cluster Analysis - Haystack Data

The notebook uses clean, processes data from Step 3 - Data Augmentation Python notebook ** Load Key Libraries **

library(magrittr)
library(HDclassif)
Loading required package: MASS
library(psych)
library(cluster)
library(ggplot2)

Attaching package: ‘ggplot2’

The following objects are masked from ‘package:psych’:

    %+%, alpha
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
── Attaching packages ─────────────────────────────────────────── tidyverse 1.3.2 ──✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
✔ purrr   0.3.5      ── Conflicts ────────────────────────────────────────────── tidyverse_conflicts() ──
✖ ggplot2::%+%()     masks psych::%+%()
✖ ggplot2::alpha()   masks psych::alpha()
✖ tidyr::extract()   masks magrittr::extract()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::lag()       masks stats::lag()
✖ dplyr::select()    masks MASS::select()
✖ purrr::set_names() masks magrittr::set_names()
library(FactoMineR)
Registered S3 methods overwritten by 'htmltools':
  method               from         
  print.html           tools:rstudio
  print.shiny.tag      tools:rstudio
  print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
  method           from         
  print.htmlwidget tools:rstudio
options(scipen=999)

This is part 2 of Cluster Analysis, we will now use mixed data types to generate clusters

Run an example of cluster analysis using FAMD

Step1: FAMD

#run_pca <- function(data_frame, pca_type, components)
dfl_famd <- run_pca(dfz_FAMD, "FAMD", 22)

meth_list <- list("ward.D", "centroid","median")#, "average", "median", "centroid")
dist_list <- list("minkowski","maximum")#,"canberra")
bclist <- list()
# loop 
for (m in meth_list) {
  for (d in dist_list){
    best_cluster <- fun_nc(dfl_famd, d, 2, 5, m)
    #bclist  <- c(bclist, best_cluster)
    bclist[[length(bclist) + 1]] <- best_cluster
  }
}
*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 7 proposed 2 as the best number of clusters 
* 3 proposed 3 as the best number of clusters 
* 8 proposed 4 as the best number of clusters 
* 3 proposed 5 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  4 
 
 
******************************************************************* 

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 7 proposed 2 as the best number of clusters 
* 10 proposed 3 as the best number of clusters 
* 2 proposed 4 as the best number of clusters 
* 2 proposed 5 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  3 
 
 
******************************************************************* 
[1] "Frey index : No clustering structure in this data set"

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 8 proposed 2 as the best number of clusters 
* 11 proposed 3 as the best number of clusters 
* 2 proposed 4 as the best number of clusters 
* 2 proposed 5 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  3 
 
 
******************************************************************* 
[1] "Frey index : No clustering structure in this data set"

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 9 proposed 2 as the best number of clusters 
* 7 proposed 3 as the best number of clusters 
* 3 proposed 4 as the best number of clusters 
* 4 proposed 5 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  2 
 
 
******************************************************************* 
[1] "Frey index : No clustering structure in this data set"

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 8 proposed 2 as the best number of clusters 
* 13 proposed 3 as the best number of clusters 
* 2 proposed 5 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  3 
 
 
******************************************************************* 
[1] "Frey index : No clustering structure in this data set"

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 9 proposed 2 as the best number of clusters 
* 6 proposed 3 as the best number of clusters 
* 5 proposed 4 as the best number of clusters 
* 3 proposed 5 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  2 
 
 
******************************************************************* 

Step 4: Execute the cluserting using dfl_FAMD

#fun_clust <- function(data_frame, distance, agglo_method, clusters)
famd_min_cen_c <- fun_clust(dfl_famd, "minkowski", "ward.D", 4)

Evaluate Clusters

table(famd_min_cen_c)
famd_min_cen_c
   1    2    3    4 
1109 1806 1265  411 
tmp_data <- zdata
tmp_data$cluster <-famd_min_cen_c
tmp_data$cluster <- as.factor(tmp_data$cluster)
p <- ggplot(tmp_data, aes(x=cluster, y=rentZestimate)) + 
  geom_boxplot()
p

q <- ggplot(tmp_data, aes(x=cluster, y=Sch_Rat_Avg)) + 
  geom_boxplot()
q

p <- ggplot(tmp_data, aes(x=cluster, y=Income_per_return)) + 
  geom_boxplot()
p

q <- ggplot(tmp_data, aes(x=cluster, y=violent_crime_total_rate)) + 
  geom_boxplot()
q

Step 5: Save the data to a file for exploratory analysis

tmp_data$ClusterCategory <- ifelse(tmp_data$cluster == 1, "Cluster 1",
                              ifelse(tmp_data$cluster == 2, "Cluster 2",
                                     ifelse(tmp_data$cluster == 3, "Cluster 3",
                                            ifelse(tmp_data$cluster == 4, "Cluster 4",
                                     "Cluster 5"))))

write.csv(tmp_data,"cluster_output4.csv", row.names = FALSE)

Remove the temporary dataset

rm(tmp_data)

Extra evaluation using gower distance and PAM


gower_dist <- cluster::daisy(dfl_famd, metric = "gower")
set.seed(123)
pam_cluster <- cluster::pam(gower_dist, k = 3)

table(pam_cluster$clustering)

   1    2    3 
1720 1593 1278 
tmp_data <- zdata
tmp_data$cluster <-pam_cluster$clustering
tmp_data$cluster <- as.factor(tmp_data$cluster)
p <- ggplot(tmp_data, aes(x=cluster, y=livingArea)) + 
  geom_boxplot()
p

q <- ggplot(tmp_data, aes(x=cluster, y=Sch_Rat_Avg)) + 
  geom_boxplot()
q

p <- ggplot(tmp_data, aes(x=cluster, y=Income_per_return)) + 
  geom_boxplot()
p

q <- ggplot(tmp_data, aes(x=cluster, y=violent_crime_total_rate)) + 
  geom_boxplot()
q

*** End of Part 2 ***

LS0tCnRpdGxlOiAiSGF5c3RhY2sgQ2x1c3RlciBBbmFseXNpcyBOb3RlYm9vayBQYXJ0IDIiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KIyBDbHVzdGVyIEFuYWx5c2lzIC0gSGF5c3RhY2sgRGF0YQpUaGUgbm90ZWJvb2sgdXNlcyBjbGVhbiwgcHJvY2Vzc2VzIGRhdGEgZnJvbSBTdGVwIDMgLSBEYXRhIEF1Z21lbnRhdGlvbiBQeXRob24Kbm90ZWJvb2sKKiogTG9hZCBLZXkgTGlicmFyaWVzICoqCgpgYGB7cn0KbGlicmFyeShtYWdyaXR0cikKbGlicmFyeShIRGNsYXNzaWYpCmxpYnJhcnkocHN5Y2gpCmxpYnJhcnkoY2x1c3RlcikKbGlicmFyeShnZ3Bsb3QyKQpsaWJyYXJ5KHRpZHl2ZXJzZSkKbGlicmFyeShGYWN0b01pbmVSKQpvcHRpb25zKHNjaXBlbj05OTkpCmBgYAoKIyMjIFRoaXMgaXMgcGFydCAyIG9mIENsdXN0ZXIgQW5hbHlzaXMsIHdlIHdpbGwgbm93IHVzZSBtaXhlZCBkYXRhIHR5cGVzIHRvIGdlbmVyYXRlIGNsdXN0ZXJzCgoKIyMjIFJ1biBhbiBleGFtcGxlIG9mIGNsdXN0ZXIgYW5hbHlzaXMgdXNpbmcgRkFNRAojIyMgU3RlcDE6IEZBTUQKYGBge3J9CiNydW5fcGNhIDwtIGZ1bmN0aW9uKGRhdGFfZnJhbWUsIHBjYV90eXBlLCBjb21wb25lbnRzKQpkZmxfZmFtZCA8LSBydW5fcGNhKGRmel9GQU1ELCAiRkFNRCIsIDIyKQpgYGAKCgpgYGB7cn0KbWV0aF9saXN0IDwtIGxpc3QoIndhcmQuRCIsICJjZW50cm9pZCIsIm1lZGlhbiIpIywgImF2ZXJhZ2UiLCAibWVkaWFuIiwgImNlbnRyb2lkIikKZGlzdF9saXN0IDwtIGxpc3QoIm1pbmtvd3NraSIsIm1heGltdW0iKSMsImNhbmJlcnJhIikKYmNsaXN0IDwtIGxpc3QoKQojIGxvb3AgCmZvciAobSBpbiBtZXRoX2xpc3QpIHsKICBmb3IgKGQgaW4gZGlzdF9saXN0KXsKICAgIGJlc3RfY2x1c3RlciA8LSBmdW5fbmMoZGZsX2ZhbWQsIGQsIDIsIDUsIG0pCiAgICAjYmNsaXN0ICA8LSBjKGJjbGlzdCwgYmVzdF9jbHVzdGVyKQogICAgYmNsaXN0W1tsZW5ndGgoYmNsaXN0KSArIDFdXSA8LSBiZXN0X2NsdXN0ZXIKICB9Cn0KYGBgCgpTdGVwIDQ6IEV4ZWN1dGUgdGhlIGNsdXNlcnRpbmcgdXNpbmcgZGZsX0ZBTUQKYGBge3J9CiNmdW5fY2x1c3QgPC0gZnVuY3Rpb24oZGF0YV9mcmFtZSwgZGlzdGFuY2UsIGFnZ2xvX21ldGhvZCwgY2x1c3RlcnMpCmZhbWRfbWluX2Nlbl9jIDwtIGZ1bl9jbHVzdChkZmxfZmFtZCwgIm1pbmtvd3NraSIsICJ3YXJkLkQiLCA0KQpgYGAKCkV2YWx1YXRlIENsdXN0ZXJzCmBgYHtyfQp0YWJsZShmYW1kX21pbl9jZW5fYykKYGBgCgpgYGB7cn0KdG1wX2RhdGEgPC0gemRhdGEKdG1wX2RhdGEkY2x1c3RlciA8LWZhbWRfbWluX2Nlbl9jCnRtcF9kYXRhJGNsdXN0ZXIgPC0gYXMuZmFjdG9yKHRtcF9kYXRhJGNsdXN0ZXIpCnAgPC0gZ2dwbG90KHRtcF9kYXRhLCBhZXMoeD1jbHVzdGVyLCB5PXJlbnRaZXN0aW1hdGUpKSArIAogIGdlb21fYm94cGxvdCgpCnAKcSA8LSBnZ3Bsb3QodG1wX2RhdGEsIGFlcyh4PWNsdXN0ZXIsIHk9U2NoX1JhdF9BdmcpKSArIAogIGdlb21fYm94cGxvdCgpCnEKcCA8LSBnZ3Bsb3QodG1wX2RhdGEsIGFlcyh4PWNsdXN0ZXIsIHk9SW5jb21lX3Blcl9yZXR1cm4pKSArIAogIGdlb21fYm94cGxvdCgpCnAKcSA8LSBnZ3Bsb3QodG1wX2RhdGEsIGFlcyh4PWNsdXN0ZXIsIHk9dmlvbGVudF9jcmltZV90b3RhbF9yYXRlKSkgKyAKICBnZW9tX2JveHBsb3QoKQpxCgpgYGAKCiMjIyBTdGVwIDU6IFNhdmUgdGhlIGRhdGEgdG8gYSBmaWxlIGZvciBleHBsb3JhdG9yeSBhbmFseXNpcwpgYGB7cn0KdG1wX2RhdGEkQ2x1c3RlckNhdGVnb3J5IDwtIGlmZWxzZSh0bXBfZGF0YSRjbHVzdGVyID09IDEsICJDbHVzdGVyIDEiLAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICBpZmVsc2UodG1wX2RhdGEkY2x1c3RlciA9PSAyLCAiQ2x1c3RlciAyIiwKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGlmZWxzZSh0bXBfZGF0YSRjbHVzdGVyID09IDMsICJDbHVzdGVyIDMiLAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGlmZWxzZSh0bXBfZGF0YSRjbHVzdGVyID09IDQsICJDbHVzdGVyIDQiLAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIkNsdXN0ZXIgNSIpKSkpCgp3cml0ZS5jc3YodG1wX2RhdGEsImNsdXN0ZXJfb3V0cHV0NC5jc3YiLCByb3cubmFtZXMgPSBGQUxTRSkKYGBgCgpSZW1vdmUgdGhlIHRlbXBvcmFyeSBkYXRhc2V0CmBgYHtyfQpybSh0bXBfZGF0YSkKYGBgCgojIyMgRXh0cmEgZXZhbHVhdGlvbiB1c2luZyBnb3dlciBkaXN0YW5jZSBhbmQgUEFNIAoKYGBge3J9Cgpnb3dlcl9kaXN0IDwtIGNsdXN0ZXI6OmRhaXN5KGRmbF9mYW1kLCBtZXRyaWMgPSAiZ293ZXIiKQpgYGAKCmBgYHtyfQpzZXQuc2VlZCgxMjMpCnBhbV9jbHVzdGVyIDwtIGNsdXN0ZXI6OnBhbShnb3dlcl9kaXN0LCBrID0gMykKCnRhYmxlKHBhbV9jbHVzdGVyJGNsdXN0ZXJpbmcpCmBgYAoKCmBgYHtyfQp0bXBfZGF0YSA8LSB6ZGF0YQp0bXBfZGF0YSRjbHVzdGVyIDwtcGFtX2NsdXN0ZXIkY2x1c3RlcmluZwp0bXBfZGF0YSRjbHVzdGVyIDwtIGFzLmZhY3Rvcih0bXBfZGF0YSRjbHVzdGVyKQpwIDwtIGdncGxvdCh0bXBfZGF0YSwgYWVzKHg9Y2x1c3RlciwgeT1saXZpbmdBcmVhKSkgKyAKICBnZW9tX2JveHBsb3QoKQpwCnEgPC0gZ2dwbG90KHRtcF9kYXRhLCBhZXMoeD1jbHVzdGVyLCB5PVNjaF9SYXRfQXZnKSkgKyAKICBnZW9tX2JveHBsb3QoKQpxCnAgPC0gZ2dwbG90KHRtcF9kYXRhLCBhZXMoeD1jbHVzdGVyLCB5PUluY29tZV9wZXJfcmV0dXJuKSkgKyAKICBnZW9tX2JveHBsb3QoKQpwCnEgPC0gZ2dwbG90KHRtcF9kYXRhLCBhZXMoeD1jbHVzdGVyLCB5PXZpb2xlbnRfY3JpbWVfdG90YWxfcmF0ZSkpICsgCiAgZ2VvbV9ib3hwbG90KCkKcQoKYGBgCgoqKiogRW5kIG9mIFBhcnQgMiAqKio=